Context: The concept of transactions is used in Use Case Points (UCP), and in many other functional size measurement methods, to capture the smallest unit of functionality that should be considered while measuring the size of a system. Unfortunately, in the case of the UCP method at least four methods for use-case transaction identification have been proposed so far. The different approaches to transaction identification and difficulties related to the analysis of requirements expressed in natural language can lead to problems in the reliability of functional size measurement. Objective: The goal of this study was to evaluate reliability of transaction identification in use cases (with the methods mentioned in the literature), analyze their weaknesses, and propose some means for their improvement. Method: A controlled experiment on a group of 120 students was performed to investigate if the methods for transaction identification, known from the literature, provide similar results. In addition, a qualitative analysis of the experiment data was performed to investigate the potential problems related to transaction identification in use cases. During the experiment a use-case benchmark specification was used. The automatic methods for transaction identification, proposed in the paper have been validated using the same benchmark by comparing the outcomes provided by these methods to on-average number of transactions identified by the participants of the experiment. Results: A significant difference in the median number of transactions was observed between groups using different methods of transaction identification. The Kruskal-Wallis test was performed with the significance level @a set to 0.05 and followed by the post-hoc analysis performed according to the procedure proposed by Conover. Also a large intra-method variability was observed. The ratios between the maximum and minimum number of transactions identified by the participants using the same method were equal to 1.96, 3.83, 2.03, and 2.21. The proposed automatic methods for transaction identification provided results consistent with those provided by the participants of the experiment and functional measurement experts. The relative error between the number of transaction identified by the tool and on-average number of transactions identified by the participants of the experiment ranged from 3% to 7%. Conclusions: Human-performed transaction identification is error prone and quite subjective. Its reliability can be improved by automating the process with the use of natural language processing techniques.