In linear regression and logistic regression, the author created a function called fit() to train the sample, In which he loops through samples from batch to batch and over and over again till all epochs are finished (I understand that the program will run a lot faster than using a single batch). When he calculated the loss, he was only using samples in the batch and then calculating gradients based on the batch loss. I thought the ultimate goal is to find a set of parameters that best describe the whole dataset (minimizing total loss across all samples), instead of fitting samples batch to batch.
I’m a little confused by this method. Since samples are different from batch to batch, each batch will train the parameter in a slightly different direction (I understand by shuffling the sample before creating batches will help). As a result, the parameter will better predict the batch that gets trained later than the batch gets trained earlier. Is this correct? or I’m misunderstanding something.