A quantization method based on a hardware of in-memory computing and a system thereof. The quantization method includes a quantization parameter providing step, a parameter splitting step, a multiply-accumulate step, a convolution quantization step and a convolution merging step. The quantization parameter providing step is performed to provide a quantization parameter, and the quantization parameter includes a quantized input activation, a quantized weight and a splitting value. The parameter splitting step is performed to split the quantized weight and the quantized input activation into a plurality of grouped quantized weights and a plurality of grouped activations, respectively, according to the splitting value. The multiply-accumulate step is performed to execute a multiply-accumulate operation with one of the grouped quantized weights and one of the grouped activations, and then generate a convolution output. The convolution quantization step is performed to quantize the convolution output to a quantized convolution output according to a convolution target bit. The convolution merging step is performed to execute a partial-sum operation with the quantized convolution output according to the splitting value, and then generate an output activation. Therefore, the quantization method of the present disclosure considers the hardware limitations of nonvolatile in-memory computing (nvIMC) to implement compact convolutional neural networks (CNNs). The nvIMC is simulated for parallel computation of multilevel matrix-vector multiplications (MVMs) by considering the constraints of an analog-to-digital convertor. A concrete-distribution based quantization method is introduced to optimize the small read margin problem caused by variations in nvIMC so as to obtain better updated weights. |