A block-based inference method for memory-efficient convolutional neural network (CNN) implementation is proposed. The block-based inference method for memory-efficient CNN implementation includes a parameter setting step, a dividing step, a block-based inference step and a temporary storing step. The parameter setting step includes setting an inference parameter group. The inference parameter group includes a depth, a block width, a block height and a kernel size. The dividing step includes driving a processing unit to divide the image into a plurality of input block data according to the depth, the block width and the block height. Each of the input block data has an input block size. The block-based inference step includes driving the processing unit to perform a multi-layer convolution operation on each of the input block data to generate an output block data. The multi-layer convolution operation includes a first direction data selecting step, a second direction data selecting step and a convolution operation step. The first direction data selecting step includes selecting a plurality of ith layer recomputing features according to a position of the output block data along a first direction, and then selecting an ith layer recomputing input feature block data according to the position of the output block data and the ith layer recomputing features. i is one of a plurality of positive integers from 1 to the depth. The second direction data selecting step includes selecting a plurality of ith layer reusing features according to the ith layer recomputing input feature block data along a second direction, and then combining the ith layer recomputing input feature block data with the ith layer reusing features to generate an ith layer reusing input feature block data. The convolution operation step includes selecting a plurality of ith layer sub-block input feature groups from the ith layer reusing input feature block data according to an ith layer kernel size, and then performing a convolution operation on each of the ith layer sub-block input feature groups to generate an ith layer sub-block output feature, and combining the ith layer sub-block output features corresponding to the ith layer sub-block input feature groups to form an ith layer output feature block data. The temporary storing step includes driving a block buffer bank to store the output feature block data and the ith layer reusing features. Therefore, the present disclosure reuses the features along the block scanning direction to reduce recomputing overheads and recomputes the features between different scan lines to eliminate the global line buffer, so that the inference flow of the present disclosure can provide great flexibility and good tradeoffs between computing and memory overheads for high-performance and memory-efficient CNN inference. |