1 Introduction
-
We use a synthetic simulation framework to study whether different state-of-the-art RL algorithms allow computing effective dynamic pricing strategies for recommerce markets in duopoly and oligopoly competition.
-
We analyze how different RL algorithms perform compared with rule-based benchmark strategies in different market scenarios and evaluate associated steady states.
-
We use self-play to identify strategies that achieve competitive results compared to strategies not seen in training.
-
We study the impact of different model parameters and information structures on the performance of RL algorithms and the associated average prices, sales, stock levels, and resource flows.
-
We demonstrate how to calibrate synthetic environments from data which allow to pre-train RL agents before applying them to incompletely known environments.
-
We provide a rich open-source simulation and evaluation framework, see code repository.1
2 Related Work
2.1 Circular Economy
2.2 Dynamic Pricing and Market Simulations
2.3 Reinforcement Learning
3 Model and Problem Description
3.1 Overview
3.2 Model Description
3.2.1 Setup
3.2.2 A Firm’s Controls and Competitors’ Reactions
3.2.3 Consumer Behavior for Buying and Reselling
3.2.4 Problem Formulation from a Single Firm’s Perspective
3.3 Application and Selection of RL Agents
3.3.1 Embedding of RL Environments
3.3.2 Selection of RL Algorithms
4 Evaluation
4.1 Definitions and Model Specifications
4.1.1 Competitors’ Strategies
4.1.2 Consumer Arrival and Behavior
4.1.3 Reproducible Example
Symbol | Explanation | Default value |
---|---|---|
\(\delta\) | Discount factor per period | 0.99 |
\(c_{virgin}\) | Purchase or production price for new products | 3 |
\(c_{inv}\) | Price per stored used product per period (step) | 0.1 |
A | Price sets \(A_{new}=A_{used}=A_{rebuy}=A\) | [0, 10] |
\(p^{(max)}\) | Maximum price for all three price sets \(A_{new}\), \(A_{used}\), \(A_{rebuy}\) | 10 |
B | Number of customers visiting the store per step | 20 |
w | Proportion of owners considering resale per step | 0.05 |
\(\theta _{new}\) | Parameter for preference function (new items) | 0.8 |
\(\theta _{used}\) | Parameter for preference function (used items) | 0.5 |
\(\kappa _{used}\) | Parameter for preference function (used items) | 0.55 |
K | Number of competing firms | 2 |
h | Price decrease for the RBB strategy | 1 |
M | Upper reference value for used products in stock | 100 |
E | Number of periods (steps) per episode | 500 |
4.1.4 Hyperparameters
4.1.5 Implementation Setup
4.2 Experiment A: An RL Agent Against the Rule-Based Strategy RBB in a Duopoly
4.3 Experiment B: An RL Agent against the Rule-Based Strategy RBB in a Duopoly (Opportunistic Version with Adapted Reward Function)
4.4 Experiment C: RL vs. RL Training via Self-Play
4.5 Experiment D: Study for Different Observable State Spaces
4.6 Experiment E: Monopoly and Oligopoly Scenarios
4.6.1 Monopoly Scenario
4.6.2 Oligopoly Scenario
4.7 Ablation Study for Steady State Results
Base | B | \(c_{virgin}\) | w | K | RSS | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Case | 10 | 30 | 2 | 4 | 0.025 | 0.075 | 3 | 4 | 5 | |||
Offer prices | \({\bar{p}}^{RL}_{new,offer}\) | 6.12 | 5.84 | 6.58 | 5.94 | 6.63 | 6.25 | 5.26 | 6.02 | 5.64 | 5.60 | 6.01 |
\({\bar{p}}^{C}_{new,offer}\) | 5.12 | 4.48 | 5.57 | 4.95 | 5.64 | 5.25 | 4.34 | 5.01 | 4.60 | 4.01 | 4.99 | |
\({\bar{p}}^{RL}_{used,offer}\) | 3.92 | 4.42 | 4.12 | 4.24 | 4.01 | 3.84 | 3.26 | 3.92 | 4.15 | 3.35 | 3.93 | |
\({\bar{p}}^{C}_{used,offer}\) | 3.34 | 3.79 | 3.56 | 3.60 | 3.64 | 3.36 | 2.68 | 2.96 | 2.82 | 2.60 | 2.92 | |
\({\bar{p}}^{RL}_{rebuy,offer}\) | 0.23 | 0.00 | 0.15 | 0.01 | 0.42 | 0.24 | 0.10 | 0.22 | 0.42 | 0.68 | 0.41 | |
\({\bar{p}}^{C}_{rebuy,offer}\) | 0.72 | 0.56 | 0.84 | 0.62 | 0.88 | 0.80 | 0.70 | 1.19 | 1.38 | 1.48 | 2.00 | |
Sales prices | \({\bar{p}}^{RL}_{new,sold}\) | 6.01 | 5.84 | 6.50 | 5.88 | 6.56 | 6.16 | 5.21 | 6.00 | 5.68 | 5.56 | 5.85 |
\({\bar{p}}^{C}_{new,sold}\) | 5.09 | 4.83 | 5.53 | 4.94 | 5.63 | 5.23 | 4.33 | 5.02 | 4.61 | 4.01 | 4.95 | |
\({\bar{p}}^{RL}_{used,sold}\) | 3.61 | 4.04 | 4.01 | 3.81 | 3.89 | 3.65 | 3.16 | 3.61 | 3.70 | 3.26 | 3.76 | |
\({\bar{p}}^{C}_{used,sold}\) | 3.07 | 3.61 | 3.37 | 3.33 | 3.44 | 3.11 | 2.51 | 3.19 | 2.84 | 2.63 | 2.88 | |
\({\bar{p}}^{RL}_{rebuy,sold}\) | 0.39 | 0.00 | 0.20 | 0.00 | 0.64 | 0.49 | 0.16 | 0.34 | 0.59 | 0.79 | 0.58 | |
\({\bar{p}}^{C}_{rebuy,sold}\) | 0.92 | 0.89 | 1.05 | 0.87 | 1.04 | 0.95 | 0.87 | 1.29 | 1.31 | 1.38 | 2.00 | |
Sales | \({\bar{X}}^{RL}_{new}\) | 3.96 | 2.42 | 4.74 | 4.44 | 4.04 | 3.40 | 4.14 | 2.84 | 2.72 | 1.86 | 3.74 |
\({\bar{X}}^{C}_{new}\) | 6.52 | 3.78 | 9.88 | 6.74 | 5.60 | 6.76 | 6.24 | 4.30 | 4.32 | 3.62 | 6.74 | |
\({\bar{X}}^{RL}_{used}\) | 1.72 | 0.62 | 2.50 | 1.46 | 1.98 | 1.88 | 2.42 | 1.24 | 0.88 | 1.12 | 2.06 | |
\({\bar{X}}^{C}_{used}\) | 3.63 | 0.98 | 5.14 | 3.00 | 3.42 | 3.82 | 4.20 | 3.34 | 2.90 | 2.06 | 3.56 | |
\({\bar{X}}^{RL}_{rebuy}\) | 1.74 | 0.68 | 2.54 | 1.44 | 2.08 | 1.78 | 2.34 | 1.18 | 0.84 | 1.14 | 2.06 | |
\({\bar{X}}^{C}_{rebuy}\) | 3.62 | 1.04 | 5.20 | 3.00 | 3.42 | 3.74 | 4.12 | 3.34 | 2.84 | 2.12 | 7.92 | |
Resource flows, stocks & rewards | \({\bar{N}}_{in\,use}\) | 258 | 156 | 349 | 265 | 232 | 482 | 173 | 184 | 238 | 293 | 244 |
\({\bar{N}}_{garbage}\) | 5.11 | 4.46 | 7.20 | 6.84 | 4.16 | 4.84 | 3.90 | 0.62 | 0.18 | 0.18 | 0.82 | |
\({\bar{N}}_{virgin}\) | 10.50 | 6.31 | 14.56 | 11.18 | 9.64 | 10.16 | 10.38 | 7.14 | 7.04 | 5.48 | 10.48 | |
\({\bar{N}}^{RL}_{stock}\) | 8.77 | 4.24 | 13.56 | 8.24 | 8.04 | 22.04 | 11.46 | 11.42 | 3.92 | 5.04 | 6.42 | |
\({\bar{N}}^{C}_{stock}\) | 8.35 | 8.06 | 9.28 | 8.78 | 7.98 | 8.74 | 8.64 | 6.02 | 5.60 | 5.24 | 29.88 | |
\({\bar{G}}^{RL}_{reward}\) | 15.60 | 8.94 | 24.72 | 21.96 | 15.89 | 14.54 | 15.30 | 11.45 | 9.66 | 5.81 | 15.74 | |
\({\bar{G}}^{C}_{reward}\) | 16.91 | 8.00 | 28.93 | 24.40 | 12.80 | 18.38 | 9.48 | 6.56 | 6.43 | 3.12 | 11.69 |
4.7.1 Remark 1
5 Calibrating Environments from Observable Data
5.1 Using Synthetic Test Environments for Pre-training
5.2 Test Example for the Base Case
5.2.1 Fitting Sales Probabilities
5.2.2 Fitting Competitor’s Price Reactions
5.2.3 Training an Agent on the Fitted Environment B
5.2.4 Evaluation of the Trained Agent on the Original Environment A
Base case | Trained agent B applied on Env. A | Agent B trained on Env. B | ||||
---|---|---|---|---|---|---|
Offer prices | \({\bar{p}}^{RL}_{new,offer}\) | 6.12 | 5.52 | (0.90) | 5.34 | (0.87) |
\({\bar{p}}^{C}_{new,offer}\) | 5.12 | 4.57 | (0.89) | 4.57 | (0.89) | |
\({\bar{p}}^{RL}_{used,offer}\) | 3.92 | 3.94 | (1.01) | 4.03 | (1.03) | |
\({\bar{p}}^{C}_{used,offer}\) | 3.34 | 3.19 | (0.95) | 2.44 | (0.73) | |
\({\bar{p}}^{RL}_{rebuy,offer}\) | 0.23 | 0.02 | (0.11) | 0.03 | (0.15) | |
\({\bar{p}}^{C}_{rebuy,offer}\) | 0.72 | 0.61 | (0.84) | 0.25 | (0.34) | |
Sales prices | \({\bar{p}}^{RL}_{new,sold}\) | 6.01 | 5.42 | (0.90) | 5.19 | (0.86) |
\({\bar{p}}^{C}_{new,sold}\) | 5.09 | 4.57 | (0.90) | 4.55 | (0.89) | |
\({\bar{p}}^{RL}_{used,sold}\) | 3.61 | 3.79 | (1.05) | 3.94 | (1.09) | |
\({\bar{p}}^{C}_{used,sold}\) | 3.07 | 2.84 | (0.93) | 2.44 | (0.79) | |
\({\bar{p}}^{RL}_{rebuy,sold}\) | 0.39 | 0.04 | (0.09) | 0.05 | (0.12) | |
\({\bar{p}}^{C}_{rebuy,sold}\) | 0.92 | 0.82 | (0.89) | 0.25 | (0.27) | |
Sales | \({\bar{X}}^{RL}_{new}\) | 3.96 | 4.50 | (1.14) | 3.26 | (0.82) |
\({\bar{X}}^{C}_{new}\) | 6.52 | 6.98 | (1.07) | 5.00 | (0.77) | |
\({\bar{X}}^{RL}_{used}\) | 1.72 | 1.38 | (0.80) | 2.62 | (1.52) | |
\({\bar{X}}^{C}_{used}\) | 3.63 | 3.62 | (1.00) | 3.14 | (0.87) | |
\({\bar{X}}^{RL}_{rebuy}\) | 1.74 | 1.58 | (0.91) | 2.68 | (1.54) | |
\({\bar{X}}^{C}_{rebuy}\) | 3.62 | 3.62 | (1.00) | 3.14 | (0.87) | |
Resource flows, stocks & rewards | \({\bar{N}}_{in\,use}\) | 258 | 277 | (1.08) | 219 | (0.85) |
\({\bar{N}}_{garbage}\) | 5.11 | 6.58 | (1.29) | 2.54 | (0.50) | |
\({\bar{N}}_{virgin}\) | 10.50 | 11.48 | (1.09) | 8.26 | (0.79) | |
\({\bar{N}}^{RL}_{stock}\) | 8.77 | 9.12 | (1.04) | 5.64 | (0.64) | |
\({\bar{N}}^{C}_{stock}\) | 8.35 | 9.40 | (1.13) | 1.58 | (0.19) | |
\({\bar{G}}^{RL}_{reward}\) | 15.60 | 14.70 | (0.94) | 16.75 | (1.07) | |
\({\bar{G}}^{C}_{reward}\) | 16.91 | 14.20 | (0.84) | 13.53 | (0.80) |
6 Discussion
6.1 Main Insights
-
RL algorithms can successfully be applied to complex recommerce markets with unknown underlying dynamics regarding consumers’ and competitors’ behaviors.
-
RL agents are able to clearly outperform commonly established rule-based agents.
-
In our experiments, at most a few thousand episodes were necessary to train the agents.
-
PPO and SAC performed best in duopoly as well as in oligopoly scenarios.
-
The default hyperparameters of the RL algorithms worked well; hardly any tuning was necessary.
-
Steady-states of controlled markets are obtained after about a few hundred periods.
-
The non-observability of both the number of resources in use and the competitors’ inventories is not critical; results hardly depend on whether they are part of the state space.
-
Applying self-play allows finding robust pricing strategies which are effective against different competitor strategies, even ones not seen in training.
-
Our numerical examples show that changes in the parameters or the setup lead to good-natured and plausible solutions, which verifies the general applicability of the model.
-
Agents can be successfully applied to incompletely known markets by pre-training them on auxiliary markets that are calibrated based on realized market data of the (hidden) target market.