The Securities Technology Analysis Center designed a study to illustrate how its benchmarks for machine learning (ML) can be constructed and used. It’s intended to help data scientists and data engineers know what to expect when using the data science tools and cloud products of this project and how to avoid common pitfalls.
The workload is topic modeling of SEC Form 10-K filings using Latent Dirichlet Allocation (LDA), a form of natural language processing (NLP).
The STAC team used this workload to explore the question of scale-up versus scale-out in a cloud environment on three SUTs (Systems Under Test):
- A single Google Cloud Platform (GCP) n1-standard-16 instance with Skylake and RHEL 7.6
- A single GCP n1-standard-96 instance with Skylake and RHEL 7.6
- A Google Cloud Dataproc (Spark as a service) cluster containing 13 x n1-standard-16 Skylake instances (1 master and 12 worker nodes) and Debian Linux 8
The test design is a proposal to elicit feedback from the STAC AI Group on use cases, benchmark design, and research priorities around ML techniques and technologies.