Scaling and Overflow Behavior:

The function is implemented using an internal 64-bit accumulator. The accumulator has a 2.62 format and maintains full precision of the intermediate multiplication results but provides only a single guard bit. Thus, if the accumulator result overflows it wraps around rather than clip. In order to avoid overflows completely the input signal must be scaled down by log2(numTaps) bits. After all multiply-accumulates are performed, the 2.62 accumulator is right shifted by 31 bits and saturated to 1.31 format to yield the final result.

So it appears you must scale down your input by log2(num_taps) to ensure you don't get overflow. So if you had 128 taps you would need to shift all your data right by log2(128) = 7 to ensure you don't overflow. Hence that's why I said @CousinItt's suggestion of using Q7.24 to avoid overflow is so useful.