Could you explain what we should do if we have a dataset that is not balanced? For example, a binary classification which contains 1000 negative and 6000 positive cases? I’d love to know how to deal with it. Shall we use sometime like 10-cross-fold validation? if so, how to implement it in python? If we can cover this topic during the following lectures, that would be great! Thank you.
This is called class imbalance . Firstly, if this represents the distribution of data in the real world scenario where the model will be used, then it’s OK to have a skewed distribution. If not, then the best solution is almost always to collect more data compared for the underrepresented class. If that’s not possible, then you can try some techniques to address the imbalance: https://www.analyticsvidhya.com/blog/2017/03/imbalanced-data-classification/