In order to understand how Naive Bayes classifiers work, I have to briefly recapitulate the concept of Bayes’ rule. The probability model that was formulated by Thomas Bayes (1701-1761) is quite simple yet powerful; it can be written down in simple words as follows:

Posterior Probability = (Conditional Probability * Prior Probability) / Evidence

Bayes theorem forms the core of the whole concept of Naive Bayes classification. The posterior probability, in the context of a classification problem, can be interpreted as: “What is the probability that a particular object belongs to a class given its observed feature values?”

A more concrete example would be: “What is the probability that a person has diabetes given a certain value for a pre-breakfast blood glucose measurement and a certain value for a post-breakfast blood glucose measurement?”

P (diabetes ∣ xi), xi = [ 90mg / dl, 145mg / dl]

Let

The general notation of the posterior probability can be written as

P(ωj ∣ xi) = P(xi ∣ ωj) ⋅ P(ωj)

P(xi)

The objective function in the naive Bayes probability is to maximize the posterior probability given the training data in order to formulate the decision rule.

To continue with our example above, we can formulate the decision rule based on the posterior probabilities as follows:

A person has diabetes if : –

P(diabetes ∣ xi) ≥ P(not-diabetes ∣ xi)

else classify a person as healthy.

- The first disadvantage is that the Naive Bayes classifier makes a very strong assumption on the shape of your data distribution, i.e. any two features are independent given the output class. Due to this, the result can be vague – hence, a “naive” classifier. This is not as terrible as people generally think, because the NB classifier can be optimal even if the assumption is violated.
- Another problem happens due to data scarcity. For any possible value of a feature, you need to estimate a likelihood value by a frequentist approach. This can result in probabilities going towards 0 or 1, which in turn leads to numerical instabilities and worse results. In this case, you need to smooth in some way your probabilities (e.g. as in sklearn), or to impose some prior on your data, however, you may argue that the resulting classifier is not naive anymore.
- A third problem arises for continuous features. It is common to use a binning procedure to make them discrete, but if you are not careful you can throw away a lot of information.

**References**

- https://sebastianraschka.com/Articles/2014_naive_bayes_1.html
- http://chem-eng.utoronto.ca/~datamining/dmc/naive_bayesian.htm
- http://ataspinar.com/2016/02/15/sentiment-analysis-with-the-naive-bayes-classifier/