General question about gradient descent with batches

In linear regression and logistic regression, the author created a function called fit() to train the sample, In which he loops through samples from batch to batch and over and over again till all epochs are finished (I understand that the program will run a lot faster than using a single batch). When he calculated the loss, he was only using samples in the batch and then calculating gradients based on the batch loss. I thought the ultimate goal is to find a set of parameters that best describe the whole dataset (minimizing total loss across all samples), instead of fitting samples batch to batch.

I’m a little confused by this method. Since samples are different from batch to batch, each batch will train the parameter in a slightly different direction (I understand by shuffling the sample before creating batches will help). As a result, the parameter will better predict the batch that gets trained later than the batch gets trained earlier. Is this correct? or I’m misunderstanding something.

I think smaller batch size also means more gradient updates. Although the batch may not be completely representative of the optimal direction to step in, it is an approximation. The larger number of gradient updates allows it to converge a lot faster. At the end of the day I think it is a computational time trade off.

1 Like

Thanks a lot, that clears my confusion.