1 Introduction
2 Background
3 The GrandStaff dataset
3.1 Ground-truth encoding
clefG2
) denotes a treble clef in the second line of the music staff and the symbol (8cc#
) indicates that the note has a duration of an eighth note (8
), has a pitch of C5 (cc
), and comes with an accidental sharp (#
), which alters the pitch of the note one semitone up. Thanks to its compactness—which eases score-representation alignment during transcription—and its compatibility with other music encodings and tools, the kern format represents an excellent choice for end-to-end OMR approaches.
8e-J
denoting an eighth note (‘8’) of pitch E (’e’) altered by a flat accidental (‘−’), but also as J8e-
.8e-J
, despite having originally being encoded as J8e-
, is encoded only as 8.e.-.J
in bekern format.23.2 Dataset building process
kern | bekern | |
---|---|---|
Max. sequence length | 1276 | 1716 |
Min. sequence length | 32 | 34 |
Avg. sequence length | 240 ± 107 | 367 ± 169 |
Unique tokens | 20,575 | 188 |
GrandStaff | Camera GrandStaff | |
---|---|---|
Max. width | 3056 | 4048 |
Max. height | 256 | 256 |
Min. width | 143 | 164 |
Min. height | 256 | 256 |
Avg. width | 783 | 1047 |
Avg. height | 256 | 256 |
4 Neural approach
4.1 End-to-end OMR
4.2 The challenge of polyphony
4.3 End-to-end polyphony transcription
4.3.1 Aligning polyphonic scores with their music representation
4.3.2 Score unfolding approach
5 Experiments
5.1 Implementations considered
5.1.1 Recurrent neural network
5.1.2 The transformer
5.1.3 Encoder-only model
5.2 Sequence codification
5.3 Evaluation procedure
5.4 Metrics
Encoding | Model | GrandStaff | Camera GrandStaff | ||||
---|---|---|---|---|---|---|---|
CER | SER | LER | CER | SER | LER | ||
kern | FCN | 14.6 | 23.9 | 67.9 | 20.6 | 30.2 | 69.0 |
CRNN | 5.0 | 7.3 | 23.2 | 7.2 | 9.9 | 29.5 | |
CNNT | 7.9 | 11.1 | 32.4 | 9.4 | 12.3 | 33.3 | |
kern-sp | FCN | 6.4 | 11.3 | 29.8 | 11.9 | 22.5 | 58.3 |
CRNN | 5.0 | 9.2 | 25.9 | 5.8 | 10.4 | 27.9 | |
CNNT | 5.1 | 7.8 | 21.4 | 5.8 | 10.3 | 27.1 | |
bekern | FCN | 8.1 | 12.1 | 35.3 | 23.6 | 28.3 | 70.8 |
CRNN | 6.1 | 9.1 | 23.4 | 9.6 | 13.0 | 34.1 | |
CNNT | 3.9 | 5.8 | 16.3 | 4.6 | 6.5 | 17.5 |
Original | Prediction | ||||
---|---|---|---|---|---|
**kern | **kern | **kern | **kern | ||
clefF4 | *clefG2 | *clefF4 | *clefG2 | ||
k[] | *k[] | *k[] | *k[] | ||
M2/4 | *M2/4 | *M2/4 | *M2/4 | ||
=- | =- | =- | =- | ||
8cL 8C | 8eeL 8cc | 8cL 8C | 8eeL 8cc | ||
8eJ 8c 8 G | 16eeL 16cc 16a | 8eJ 8cc 8 G | 16eeL 16cc 16a | ||
. | 16eeJJ[ 16cc[ 16 g[ | . | 16eeJJ[ 16cc[ 16 g[ | ||
8GL 8GG | 16eeLL] 16cc] 16 g] | 8GL 8GG | 16eeLL] 16cc] 16 g] | ||
. | 16eeJ 16cc 16a | . | 16eeJ 8 8e | ||
8eJ 8c 8 G | 8eeJ 8cc 8 g | 8eJ 8 | 8eeJ 8cc 8 g | ||
= | = | = | = | ||
8dL 8D | 8ffL 8b | 8dL 8D | 8ffL 8b | ||
8fJ 8B 8 G | 16ffL 16b 16a | 8fJ 8B 8 G | 16ffL 16b 16g | ||
. | 16ffJJ[ 16b[ 16 g[ | . | 16ffJJ[ 16b[ 16 g[ | ||
8GL 8GG | 16ffLL] 16b] 16 g] | 8GL 8GG | 16ffLL] 16b] 16 g] | ||
. | 16ffJ 16a | . | 16ffJ 16a | ||
8fJ 8B 8 G | 8ffJ 8 g | 8fJ 8B 8 G | 8ffJ 8 g | ||
= | = | = | = | ||
8cL 8C | 8eeL 8cc | 8cL 8C | 8eeL 8cc | ||
8eJ 8c 8 G | 16ffL 16cc 16a | 8eJ 8c 8 G | 16ffL 16cc 16a | ||
. | 16eeJJ[ 16cc[ 16 g[ | . | 16eeJJ[ 16cc[ 16 g[ | ||
8eL 8c 8 G | 8eeL] 8cc] 8 g] | 8eL 8c 8 G | 8eeL] 8cc] 8 g] | ||
8GJ 8GG | 8ggJ 8dd 8b | 8GJ 8GG | 8ggJ 8dd 8b | ||
= | = | = | = | ||
\(*^{\hat{}}\) | * | \(*^{\hat{}}\) | * | ||
4c | 8cL 8C | 16ccLL | 4c | 8cL | 16ccLL |
. | . | 16ee | . | . | 16ee |
. | 8eJ 8c 8 G | 16gg | . | 8eJ 8c 8 G | 16gg |
. | . | 16cccJJ | . | . | 16cccJJ |
4B- | 8B-L 8BB- | 16ccLL | 4B- | 8B-L 8BB- | 16ccLL |
. | . | 16ee | . | . | 16ee |
. | 8eJ 8c 8 G | 16gg | . | 8eJ 8c 8 G | 16gg |
. | . | 16cccJJ | . | . | 16cccJJ |
= | = | = | = | = | = |
v | *v | * | *v | *v | * |
- | *- | *- | *- |
6 Results
6.1 Evaluation on monophonic scores
Architecture | Reshape method | SER |
---|---|---|
FCN | Vertical collapse | 6 |
Unfolding | 7.8 | |
CRNN | Vertical collapse | 3.3 |
Unfolding | 4.8 | |
CNNT | Vertical collapse | 9.8 |
Unfolding | 10.4 | |
State of the art [46] | Vertical collapse | 4.7 |