The software package Qcompiler (Chen and Wang 2013) provides a general quantum compilation framework, which maps any given unitary operation into a quantum circuit consisting of a sequential set of elementary quantum gates. In this paper, we present an extended software OptQC, which finds permutation matrices P and Q for a given unitary matrix U such that the number of gates in the quantum circuit of U=QTPTU′PQ is significantly reduced, where U′ is equivalent to U up to a permutation and the quantum circuit implementation of each matrix component is considered separately. We extend further this software package to make use of high-performance computers with a multiprocessor architecture using MPI. We demonstrate its effectiveness in reducing the total number of quantum gates required for various unitary operators. Program summaryProgram title: OptQCCatalogue identifier: AEUA_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEUA_v1_0.htmlProgram obtainable from: CPC Program Library, Queen’s University, Belfast, N. IrelandLicensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.htmlNo. of lines in distributed program, including test data, etc.: 178435No. of bytes in distributed program, including test data, etc.: 491574Distribution format: tar.gzProgramming language: Fortran, MPI.Computer: Any computer with Fortran compiler and MPI library.Operating system: Linux.Classification: 4.15.Nature of problem: It aims to minimize the number of quantum gates required to implement a given unitary operation.Solution method: It utilizes a threshold-based acceptance strategy for simulated annealing to select permutation matrices P and Q for a given unitary matrix U such that the number of gates in the quantum circuit of U=QTPTU′PQ is minimized, where U′ is equivalent to U up to a permutation. The decomposition of a unitary operator is performed by recursively applying the cosine–sine decomposition.Running time: Running time increases with the size of the unitary matrix, as well as the prescribed maximum number of iterations for qubit permutation selection and the subsequent simulated annealing algorithm. Running time estimates are provided for each example in Section 4. All simulation results presented in this paper are obtained from running the program on the Fornax supercomputer managed by iVEC@UWA with Intel Xeon X5650 CPUs.