3
$\begingroup$

How should I handle a mass-point in the dependent variable when running OLS regression in R?

I’m working with a a household expenditure dataset (Living Costs 2019) where the dependent variable is the weekly combined gas and electricty bill. One issue is that about 40% of the 5008 households report the exact same value: £2.2428 per week.

All of these cases correspond to households whose payment method is coded as “Not Direct Debit”. For these households the energy expenditure variable has no variation at all — all values are £2.2428.

Despite these records showing the same figure, they are distinctly different, with different characteristics for the rest of the 30ish variables they have.

My current plan is to estimate two models:

Full sample OLS model (all households, including the mass-point)

Restricted sample OLS model (excluding all households where expenditure = £2.2428)

The idea is that the first model reflects population-wide differences including tariff/payment effects, while the second model isolates variation among households with meaningful, continuous expenditure data.

My questions are:

Is this two-model approach statistically valid for handling a dependent variable with a large mass-point?

Are there better practices in R for dealing with this kind of issue?

Should the payment-method variable be included as an explanatory variable in the full model, or does this introduce its own bias because the mass-point is entirely determined by that category?

Any advice on best practice for handling this structure in OLS or alternative models would be appreciated.

New contributor
Jim is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
$\endgroup$
8
  • 1
    $\begingroup$ Welcome to CV. Why does "not direct debit" always mean the same amount? I admit I am totally unfamiliar with how this works, even in the US (where I live) much less the UK, but it sounds like it might be a missing data problem. That would very much affect the answer. $\endgroup$ Commented yesterday
  • 1
    $\begingroup$ That number 2.2428 looks very strange. Why does it have four decimals, is this a calculated value based on monthly or quarterly payments? Also, I have zero understanding of gas and electricity prices in the UK, but less than 10 GBP sounds extremely low. So I agree with @PeterFlom: the first order of business is to understand your data. $\endgroup$ Commented yesterday
  • $\begingroup$ Hi Peter and Stephan. It's very strange indeed, although its from the raw data file that I have used for this analysis assignment. $\endgroup$ Commented yesterday
  • 2
    $\begingroup$ If 40% of a feature are an error, then my advice would not be to look for a good model, but to investigate what is happening here. $\endgroup$ Commented yesterday
  • 1
    $\begingroup$ "If I had six hours to chop down a tree, I would spend four of them sharpening my axe." Abraham Lincoln. It's time to spend some time sharpening your axe by figuring out what is going on. Then it may be time to abandon this problem GIGO. $\endgroup$ Commented 23 hours ago

1 Answer 1

4
$\begingroup$

This answer is in three parts, after reading the comments above.

First, spend a long while figuring out what is going on and, if you can't figure out what is going on, stop working on this problem. Deleting 40% of your data because you don't understand is not a good option.

Second, if you find out that those numbers are some kind of error, and the error can't be fixed, then you might just delete those cases, but I'd be inclined to stopping the analysis. Too much wrong data will make any results extremely dubious.

Third, if you somehow find out that this is the right figure (I'm not sure how that would occur) then see this thread with a number of possible solutions.

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.