The present invention includes the following steps: loading a master policy, a plurality of sub-policies, and environment data; wherein each of the sub-policies has a
different inference cost; selecting one of the sub-policies as a selected sub-policy by using the master policy; generating at least one action signal according to the selected sub-policy; applying the at least one action signal to an action executing unit; detecting at least one reward signal from an environment, wherein the at least one reward signal corresponds to at least one reaction of the action executing unit executing the at least one action signal; calculating a master reward signal of the master policy according to the at least one reward signal and an inference cost from the selected sub-policy; training the master policy to decide whether to select the selected sub-policy by using the master reward signal, decreasing inference cost of a deep neural network model and outputting a satisfying result. |