Abstract

We provide a methodological contribution by developing, describing and evaluating a method for automatically retrieving and analysing text from digital PDF annual report files published by firms listed on the London Stock Exchange (LSE). The retrieval method retains information on document structure, enabling clear delineation between narrative and financial statement components of reports, and between individual sections within the narratives component. Retrieval accuracy exceeds 95% for manual validations using a random sample of 586 reports. Large-sample statistical validations using a comprehensive sample of reports published by non-financial LSE firms confirm that report length, narrative tone and (to a lesser degree) readability vary predictably with economic and regulatory factors. We demonstrate how the method is adaptable to non-English language documents and different regulatory regimes using a case study of Portuguese reports. We use the procedure to construct new research resources including corpora for commonly occurring annual report sections and a dataset of text properties for over 26,000 U.K. annual reports.

Highlights

  • Annual reports provide important information to support decision-making (EY 2015: 6, CFA Society U.K. 2016).1 Extant large sample automated analysis of annual report commentaries focuses almost entirely on Form 10-K filings for U.S registrants accessed through the Securities and Exchange Commission’s (SEC) EDGAR system (El-Haj et al 2019)

  • We provide a methodological contribution by developing, describing and evaluating an automated procedure for retrieving and classifying the narrative component of glossy annual reports presented as digital PDF files

  • In particular and consistent with Lang and Stice-Lawrence (2015), we show how annual report length increased for London Stock Exchange (LSE) Main Market (Alternative Investment Market) firms following mandatory adoption of International Financial Reporting Standards in 2005 (2007)

Read more

Summary

Introduction

Annual reports provide important information to support decision-making (EY 2015: 6, CFA Society U.K. 2016). Extant large sample automated analysis of annual report commentaries focuses almost entirely on Form 10-K filings for U.S registrants accessed through the Securities and Exchange Commission’s (SEC) EDGAR system (El-Haj et al 2019). Results reveal how text attributes correlate predictably with regulatory features and managers’ reporting incentives, and how higher quality disclosures are associated with positive stock market outcomes They extract text from unstructured PDF English-language reports by converting files to ASCII format using Xpdf and QPDF proprietary software and construct aggregate measures of the entire textual content of glossy annual reports. Our approach involves identifying a set of common section titles and associated synonyms based on an initial sample of 50 reports selected at random We use this provisional list of headers to identify the contents page by matching the text on each page of the document against our key-phrase list. Reports indicating that document structure and section-level text retrieval is based on document bookmarks rather than the report table of contents.

Classification
Text processing
Statistical evaluation
Other sections
12. Primary financial statements

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.